Multi-task learning for Joint Language Understanding and Dialogue State Tracking

2018-11-23

DST, NLP, NLU

本文是来自于SIGdial的一篇文章，主要研究的是LU和DST的联合建模，在不影响准确率的前提下提高了计算效率，同时在训练阶段引入了Scheduled Sampling，亦是创新点。本篇文章是作者一系列关于LU和DST的最新作，可结合之前两篇对比来看（参见Reference）。
paper link
dataset link

Introduction

在一个任务型对话系统中，通常包含自然语言理解模块(LU)和状态追踪模块(DST or belief tracking component)，LU的作用是将用户输入的自然语言转成语义帧的形式，DST则是跟踪对话上下文以及对话状态。在每一个对话轮次，DST使用LU生成的语义帧来更新对话状态DS：

The DS accumulates the preferences specified by the user over the dialogue and is used to make requests to a backend. The results from the backend and the dialogue state are then used by a dialogue policy module to generate the next system response.

级联对话组件往往会导致错误传递，最近的研究热点集中于LU和DST联合建模，目的是减小计算量，同时修正LU模块带来的错误。

However, combining joint modeling with the ability to scale to multiple domains and handle slots with a large set of possible values, potentially containing entities not seen during training, are active areas of research.

本文提出了一个LU和DST联合建模的多任务学习模型。与Bing Liu and Ian Lane. 2017. An end-to-end trainable neural network model with belief tracking for task-oriented dialog. InProceedings of Interspeech类似，本文采用RNN来获取对话的上下文信息，RNN的隐层状态被用来做意图识别、dialog act分类以及slot标注；slot tags（Figure 2）获取的slot value被用来更新每个slot可能的值集合。与Rastogi et al. 2017 类似，使用共享权重的recurrent scoring network对slot value评分，使得DST可以处理在训练集未出现过的实体，例如OOV的slot value。

During inference, the model uses its own predicted slot tags and previous turn dialogue state. However, ground truth slot tags and dialogue state are used for training to ensure stability. Aiming to bridge this gap between training and inference, we also propose a novel scheduled sampling (Bengio et al., 2015) approach to joint language understanding and dialogue state tracking.

Model Architecture

设定一个对话由T turns 组成，每个轮次包含用户的utterance和系统的dialog acts。Figure 3 给出了整个模型的架构图，共包含下列模块：

a user utterance encoder
a system act encoder
a state encoder
a slot tagger and a candidate scorer

$Figure 3: Architecture of our joint LU and DST model as described in Section 3. $x^{t}$ is the sequence of user utterance token embeddings, $a^{t}$ is the system act encoding and blue arrows indicate additional features used by DST as detailed in Section 3.8.$

At each turn $t \epsilon \left \{ 1,2,…,T\right \}$ , the model takes a dialogue turn and the previous dialogue state $D^{t-1}$ as input and outputs the predicted user intent, user dialogue acts, slot values in the user utterance and the updated dialogue state $D^{t}$.

As a new turn arrives, the system act encoder (Section 3.1) encodes all system dialogue acts in the turn to generate the system dialogue act vector $a^{t}$. Similarly, the utterance encoder (Section 3.2) encodes the user utterance into a vector $u^{t}_{e}$ , and also generates contextual token embeddings $u^{t}_{o}$ for each utterance token. The state encoder (Section 3.3) then uses $a^{t}$, $u_{e}^{t}$ and its previous turn hidden state, $d_{st}^{t-1}$, to generate the dialogue context vector $d_{o}^{t}$ , which summarizes the entire observed dialogue, and its updated hidden state $d_{st}^{t}$ .

The dialogue context vector $d_{o}^{t}$ is then used by the user intent classifier (Section 3.4) and user dialogue act classifier (Section 3.5). The slot tagger (section 3.6) uses the dialogue context from previous turn $d_{o}^{t-1}$ , the system act vector $a^{t}$ and contextual token embeddings $u^{t}_{o}$ to generate refined contextual token embeddings $s_{o}^{t}$ . These refined token embeddings are then used to predict the slot tag for each token in the user utterance.

The system dialogue acts and predicted slot tags are then used to update the set of candidate values for each slot (Section 3.7). The candidate scorer (Section 3.8) then uses the previous dialogue state $D^{t-1}$ , the dialogue context vector $d_{o}^{t}$ and other features extracted from the current turn (indicated by blue arrows in Figure 3) to update the scores for all candidates in the candidate set and outputs the updated dialogue state $D^{t}$ . The following sections describe these components in detail.

System Act Encoder

设定每个系统的dialog act包含一个act type, 一个可选的slot和slot value。在System Act Encoder模块中，首先把输入的dialog acts编码成one-hot向量，共分为三类：

没有slot和slot value的dialog act编码成 $a_{utt}^{t}$：例如 greeting, negate
只有一个slot s的dialog act编码成 $a_{slot}^{t}(s)$：例如 request(date)
同时包含一个slot s和value c的dialog act编码成 $a_{cand}^{t}(s, c)$ ：例如offer(time=7pm)

然后根据这些one-hot向量做如下运算：

其中：$e_{s}$代表slot embedding，训练得到。

Utterance Encoder

Utterance Encoder的作用是获得用户输入token sequence的表征，输入为用户的自然语言序列（首末分别加上SOS和EOS token），输出为对应token embedding。

Utterance Encoder本质上是一个单层双向的GRU，注意这里初始化隐层状态的时候直接用的零向量：
$$u^{t}, u_{o}^{t}=BRNN_{GRU}(x^{t}) $$

State Encoder

State Encoder 是一个单向单层的GRU RNN，在每一个时间步长t，输入为 $a^{t} \bigoplus u_{t}^{e}$ ，以及上一步的隐层 $d_{st}^{t-1}$，输出更新后的隐层状态 $d_{st}^{t}$ 以及对话上下文表征 $d_{o}^{t}$，对于GRU而言，这两者是等同的。

User Intent Classification

The user intent is used to identify the backend with which the dialogue system should interact.

本文设定，用户在对话过程中可以切换intent，但是每一个轮次只能有一个intent。intent可以看成多分类任务，在每一轮预测基于所有intent集合的概率分布：

$p_{i}^{t}$：len=$\left|I \right|$，intent 概率分布
$I$：user intent set

User Dialogue Act Classification

We model user dialogue act classification as a multilabel classification problem, to allow for the presence of more than one dialogue act in a turn.

在每个对话轮次，act a的概率被预测：

$A_{u}$：dialogue act set
$W_{a}\epsilon R^{d \times \left|A_{u} \right|}$
$d = dim(d_{o}^{t})$

For each act $\alpha$, $p_{a}^{t}(\alpha)$ is interpreted as the probability of presence of in turn t. During inference, all dialogue acts with a probability greater than $t_{u}$ are predicted, where 0 < tu < 1.0 is a hyperparameter tuned using the dev set.

Slot Tagging

Slot tagging is the task of identifying the presence of values of different slots in the user utterance.We use the IOB tagging scheme (Tjong Kim Sang and Buchholz 2000, see Figure 2) to assign a label to each token. These labels are then used to extract the values for different slots from the utterance.

Slot Tagger是一个Bi-LSTM，输入为Utterance Encoder的输出token embedding $u_{o}^{t}$ 和 system act encoding $a^{t}$ 的拼接，得到 $s_{o}^{t}=\left \{ s_{o,m}^{t}\epsilon R^{2d_{s}},0\leq m< M^{t} \right \}$，$M^{t}$ 是用户输入的token序列长度，$d_{o}^{t-1}$ 用来初始化隐层状态，cell state初始化为零向量。对于第m个token，使用 $s_{o,m}^{t}$ 做softmax分类得到基于 $2\left| S\right|+1$ 个标签的概率分布，S是所有的slot构成的集合。

Updating Candidate Set

候选集 $C_{s}^{t}$ 为一个slot s对应的值集合，其中slot一定是在turn t之前（包括turn t）被系统或者用户提到过。

Rastogi et al. 2017(参见ref 3) proposed the use of candidate sets in DST for efficiently handling slots with a large set of values. In their setup, the candidate set is updated at every turn to include new values and discard old values when it reaches its maximum capacity. The dialogue state is represented as a set of distributions over value set $V_{s}^{t} = C_{s}^{t}\cup \left \{ \delta ,\phi \right \}$ for each slot $s\epsilon S^{t}$, where $\delta$ and $\phi$ are special values dontcare (user is ok with any value for the slot) and null (slot not specified yet) respectively, and $S^{t}$ is the set of all slots that have been mentioned either by the user or the system till turn t.

本文提出的模型采取了相同的定义和更新方式。在每个轮次，使用slot tagger预测的标签和系统包含slot的act来更新candidate sets。为了便于并行计算，所有的candidate sets被填充成相同的长度，为了区分，定义 $m_{v}^{t}(s,c)$ 来表示候选值是否是填充的，当slot value是真实的，取 $m_{v}^{t}(s,c)$ 为1；否则取0。

Candidate Scorer

The candidate scorer predicts the dialogue state by updating the distribution over the value set $V_{s}^{t}$ for
each slot $s\epsilon S^{t}$.

作者定义三种中间特征，与system dialog act三类对应（$a_{utt}^{t}$，$a_{slot}^{t}(s)$， $a_{cand}^{t}(s, c)$）：

$r_{utt}^{t}$：在所有的value candidate sets中共享
$r_{slot}^{t}(s)$：更新scores for $V_{s}^{t}$
$r_{cand}^{t}(s, c)$：slot s取值c的特征

其中：

$p_{\delta }^{t-1}(s)$ ：dontcare的score
$p_{\phi }^{t-1}(s)$ ：null的score
$p_{c}^{t-1}(s)$ ：slot s取值为c的score；如果c不属于$C_{s}^{t}$，则取0
$m_{u}^{t}(c)$ ：如果值c是user utterance的substring，取值为1；否则为0。目的是表明user最近提及到哪个候选值

得到上述的几个中间特征之后，使用FC来获得具体的score distribution over $V_{s}^{t}$：

其中：

$l_{s}^{t}(\delta)$ ：denotes the logit for dontcare
$l_{s}^{t}(\phi)$ ：a trainable parameter
$l_{s}^{t}(c)$ ：denotes the logit for a candidate $c\epsilon C_{s}^{t}$

These logits are obtained by processing the corresponding features using feedforward neural networks $FF_{cs}^{1}$ and $FF_{cs}^{2}$ , each having one hidden layer. The output dimension of these networks is 1 and the dimension of the hidden layer is taken to be half of the input dimension. The logits are then normalized using softmax to get the distribution $p_{s}^{t}$ over $V_{s}^{t}$.

Scheduled Sampling

DST is a recurrent model which uses predictions from the previous turn. For stability during training, ground truth predictions from the previous turn are used. This causes a mismatch between training and inference behavior.

本文使用scheduled sampling来解决上述问题。这种方法已经被证实能够有效地改善slot tagging的性能。scheduled sampling的本质是在训练时以一定的概率选取真实值和预测值，人为引入噪声。

slot tagger的准确率对DST有着重要的影响，因为未被识别的slot value不会被加入到candidate set中。为了解决这个问题，在训练阶段，作者以一定的概率来选取真实slot tags $c_{u}^{-t}$和预测的slot tags $c_{u}^{t}$，定义$p_{c}$ 为选择$c_{u}^{-t}$的概率，训练初始阶段 $p_{c}=1$ ，之后逐渐减小。目的是人为增加candidate set的噪声。

同理，candidate scorer 也采用相同的方式。

During inference, the candidate scorer only has access to its own predicted scores in the previous turn (Equations 13 and 14). To better mimic this setup during training, we start with using ground truth previous scores taken from
$\overline{D}^{t-1}$(i.e. with keep probability $p_{D}$ = 1) and gradually switch to $D^{t-1}$, the predicted previous scores, reducing $p_{D}$.

Experiments

本文的主要贡献在于：

LU和DST联合建模，减小了计算量的同时性能并没有下降；
使用scheduled sampling提高了DST在inference阶段的鲁棒性。

为了验证以上两个贡献，作者做了两组实验分别验证以上两个想法：

Separate vs Joint LU-DST ：Figure 3 shows the joint LU-DST setup where parameters in the utterance encoder and state encoder are shared across LU tasks (intent classification, dialogue act classification and slot tagging) and DST (candidate scoring). As baselines, we also conduct experiments where LU and DST tasks use separate parameters for utterance and state encoders.
Scheduled Sampling：根据slot tagger和candidate scorer是否使用Scheduled Sampling分为四组
- None : Ground truth slot tags ($c_{u}^{-t}$) and previous dialogue state ($\overline{D}^{t-1}$) are used for training.
- Tags : Model samples between ground truth ($c_{u}^{-t}$) and predicted ($c_{u}^{t}$) slot tags, sticking to
  ground truth previous state.
- State : Model samples between ground truth ($\overline{D}^{t-1}$) and predicted ($D^{t-1}$) previous state, sticking to ground truth slot tags.
- Both : Model samples between $\overline{D}^{t-1}$ and $D^{t-1}$ as well as between $c_{u}^{-t}$ and $c_{u}^{t}$.

设定 $k_{pre} = 0.3k_{max}$。

Evaluation Metrics

本文共采用5种评价指标：

user intent classification accuracy：
F1 score for user dialogue act classification
frame accuracy for slot tagging：全部的slot预测正确才算正确
joint goal accuracy：joint goal accuracy is the fraction of turns for which the predicted and ground truth dialogue state match for all slots

The output of DST is a distribution of probabilities for candidate values of each slot. To calculate the slot assignments, the value(s) with probability above a threshold (tuned using dev set) are chosen. We use joint goal accuracy as the metric for evaluation. This metric compares the predicted slot assignments to the ground truth at every dialogue turn, and the output is considered correct only if all the predicted slot values exactly match the ground truth values. (参见Ref 3)
slot F1 score for DST：评估DST时，作者只使用前一轮次预测出的slot values和dialog state，而非真实值

Datasets

Simulated Dialogues
DSTC2

Training

We use sigmoid cross entropy loss for dialogue act classification and softmax cross entropy loss for all other tasks. During training, we minimize the sum of all task losses using ADAM optimizer (Kingma and Ba, 2014), for 100k training steps with batches of 10 dialogues each. We used grid-search to identify the best hyperparameter values (sampled within specified range) for learning rate (0.0001 - 0.005) and token embedding dimension(50 - 200). For scheduled sampling experiments, the minimum keep rate i.e. pmin is varied between 0.1 - 0.9 with linear decay. The layer sizes for the utterance encoder and slot tagger are set equal to the token embedding dimension, and that of the state encoder to half this dimension.
Slot Value dropout - To make the model robust to OOV tokens arising from new entities not present in the training set, we randomly replace slot value tokens in the user utterance with a special OOV token with a probability that linearly increases from 0.0 to 0.4 during training.

Results and Discussion

The joint LU-DST model with scheduled sampling (SS) on both slot tags and dialogue state performs the best
SS on slottags helps the most with Sim-R and DSTC2: our two datasets with the most data, and low OOV rates, while SS on both slot tags and dialogue state helps more on the smaller Sim-M.
Slot value dropout (Section 5.3), improves LU as well as DST results consistently.

Conclusions

In this work, we present a joint model for language understanding (LU) and dialogue state tracking (DST), which is computationally efficient by way of sharing feature extraction layers between LU and DST, while achieving an accuracy comparable to modeling them separately across multiple tasks. We also demonstrate the effectiveness of scheduled sampling on LU outputs and previous dialogue state as an effective way to simulate inference-time conditions during training for DST, and make the model more robust to errors.